This data is RNA-seq data from monocytes and CD4 T cells from healthy control patients, patients with relapsing-remitting MS (RRMS) and patients with secondary progessive MS (SPMS). The monocytes and CD4 T cells are taken from the same sample. Each sample was run twice as technical replicates. The affected were age matched with the controls, so there are 3 patients that should be analyzed together if possible in each group. They would like to look at all pairwise comparisons with the controls, RRMS and SPMS samples within each cell type. They are not interested in looking across the different cell types.
The data is RNA-seq 3’DGE data– it appears as if it is SCRB-seq data to me or something similar to that. The table of counts have been provided, we have the total counts and the UMI deduped counts; we’ll be using the deduped counts provided by the Broad.
This work is for the Gandhi lab, and Maria Mazzola is the person leading the project.
First we’ll read in the count data generated by the Broad and convert the column names to the well numbers:
> library(readr)
> library(dplyr)
> library(tidyr)
> library(ggplot2)
> library(cowplot)
> library(viridis)
> counts = read_tsv("../data/SSF-1505_plate_GTAGAGGA.unq.refseq.umi.dat") %>%
+ as.data.frame()
> rownames(counts) = counts[, 1]
> counts[, 1] = NULL
> cnames = unlist(lapply(strsplit(colnames(counts), split = "_"), tail, n = 1))
> colnames(counts) = cnames
> head(counts)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 B1 B2 B3 B4 B5 B6 B7 B8 B9
A1BG 9 13 1 0 4 6 4 5 6 10 3 5 12 7 0 2 21 10 6 4 10
A1BG-AS1 2 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 3 2 0 0 0
A1CF 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2M 3 0 4 3 1 0 3 3 0 0 8 11 0 0 0 2 1 1 2 0 0
A2M-AS1 0 2 5 5 2 0 2 2 0 0 4 5 2 0 2 2 0 1 1 0 1
A2ML1 5 5 0 3 3 4 1 2 4 3 3 2 6 3 3 3 5 2 3 1 3
B10 B11 B12 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 D5
A1BG 11 4 3 8 4 5 6 5 3 8 3 9 9 1 4 4 6 2 3 6
A1BG-AS1 0 0 0 2 1 0 2 0 1 1 2 0 1 1 0 0 1 0 1 0
A1CF 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2M 1 4 3 0 0 0 1 1 0 1 0 1 0 5 1 0 0 1 4 0
A2M-AS1 1 4 1 0 1 0 1 3 4 3 0 0 0 1 3 2 3 2 0 0
A2ML1 3 2 2 2 0 2 4 9 5 1 2 4 2 6 3 2 4 0 2 7
D6 D7 D8 D9 D10 D11 D12 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 F1
A1BG 6 3 3 2 4 5 4 1 3 4 6 8 6 4 6 8 5 3 4 4
A1BG-AS1 1 0 0 0 0 1 0 0 2 0 0 0 1 2 0 0 0 4 0 0
A1CF 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2M 0 2 7 0 2 2 1 0 0 3 0 0 1 1 4 0 1 5 4 0
A2M-AS1 1 2 4 0 2 3 2 0 2 3 3 3 2 3 2 0 0 1 7 0
A2ML1 2 2 1 4 0 4 0 0 6 4 1 3 2 1 3 3 1 1 2 3
F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 G1 G2 G3 G4 G5 G6 G7 G8 G9
A1BG 10 5 2 3 8 5 4 9 8 2 0 3 6 2 2 3 4 2 2 4
A1BG-AS1 0 0 2 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
A1CF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2M 0 1 1 0 0 0 3 0 0 5 4 1 0 1 2 0 0 4 7 0
A2M-AS1 1 2 3 1 1 3 2 0 2 1 2 2 0 1 2 0 2 3 2 0
A2ML1 4 2 3 4 1 1 2 1 1 1 3 0 4 1 1 5 10 1 2 4
G10 G11 G12 H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
A1BG 0 1 1 10 5 5 3 6 3 0 6 3 4 4 3
A1BG-AS1 0 0 0 0 1 0 0 1 0 0 2 1 3 1 0
A1CF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2M 1 1 0 0 0 0 1 0 1 1 4 0 0 2 1
A2M-AS1 1 0 1 0 1 3 0 0 0 0 0 0 0 1 1
A2ML1 5 1 1 4 2 1 0 1 2 1 1 4 2 0 3
We’ll also read in the CSV of metadata and place some of the information into more specific columns. We added columns for phenotype, group, celltype and sample to make it easier to work with the samples later.
> metadata = read_csv("../metadata/sample-info.csv") %>% tidyr::separate(samplename,
+ c("sample", "celltype"), sep = "_")
> metadata$phenotype = substr(metadata$sample, 1, 2)
> metadata$group = as.factor(substr(metadata$sample, 3, 3))
> head(metadata)
Source: local data frame [6 x 12]
cell sample celltype rin age gender sampled processed
(chr) (chr) (chr) (dbl) (int) (chr) (chr) (chr)
1 A1 HC1 mono 9.4 35 F 12.02.2014 01.27.2015
2 A3 HC1 Tcells 8.9 35 F 12.02.2014 01.27.2015
3 A5 RR1 mono 9.3 36 F 05.13.2013 02.13.2015
4 A7 RR1 Tcells 8.8 36 F 05.13.2013 02.13.2015
5 A9 SP1 mono 9.7 37 F 12.01.2011 02.10.2015
6 A11 SP1 Tcells 9.2 37 F 12.01.2011 02.10.2015
Variables not shown: viable_pct (dbl), edss (dbl), phenotype (chr), group
(fctr)
Now we’ll sanity check the metadata. Samples in the sample group should be all the same gender and should be similar in age. We’ll plot those to make sure I didn’t screw anything up.
> ggplot(metadata, aes(group, age, color = gender, size = edss)) + geom_jitter() +
+ theme(text = element_text(family = "Gill Sans")) + scale_color_viridis(discrete = TRUE)
Looks good.
The count data has the correct number of columns 96 and a reasonable number of rows 23895. There are no missing values for the counts or anything like that so that data looks good to go as well.
Here we calculate a bunch of sumary statistics at the level of the cell.
> cellstats = data.frame(cell = colnames(counts), total_counts = colSums(counts),
+ genes_detected = colSums(counts > 0), robust_genes_detected = colSums(counts >
+ 10)) %>% left_join(metadata, by = "cell")
Overall the total counts per cell has a pretty even distribution, hovering around 0.5 - 1 million counts per cell. Groups 7 and 8 seem to have a consistently lower counts per cell than the rest of the samples. Usually I colored the bars by rin number, age, gender and viable_pct (not shown) but none of those seemed to correspond to the lower number of counts.
> ggplot(cellstats, aes(cell, total_counts)) + geom_bar(stat = "identity") + facet_wrap(~group,
+ scales = "free_x") + theme(axis.text.x = element_text(angle = 90, hjust = 1,
+ size = 8), text = element_text(family = "Gill Sans")) + xlab("") + ylab("total counts")
The plots of genes detected looks pretty good, a nice even amount for all of the samples. The number of genes detected is high, around half of the queried genes per cell, which is good. Groups 7 and 8 tend to have a slightly lower number of genes detected than the other cells, which we expected due to having a lower number of counts.
> ggplot(cellstats, aes(cell, genes_detected)) + geom_bar(stat = "identity") +
+ facet_wrap(~group, scales = "free_x") + theme(axis.text.x = element_text(angle = 90,
+ hjust = 1, size = 8), text = element_text(family = "Gill Sans")) + xlab("") +
+ ylab("genes detected")
Here we plot genes detected vs the total number of counts. We see a clear relationship between the number of genes detected and the total number of counts, suggesting that we could detect even more genes if we sequenced these samples more deeply. This plot is mainly to detect if there are groups of failed cells that don’t seem to follow along the polynomial line we fit.
> ggplot(cellstats, aes(total_counts, genes_detected)) + geom_point() + geom_smooth(method = "lm",
+ formula = y ~ poly(x, 2), size = 1) + theme(text = element_text(family = "Gill Sans")) +
+ xlab("total counts") + ylab("genes detected")
Here we run principal component analysis (PCA) on the count data. Before performing PCA we center and scale the data and correct for the mean-variance relationship in RNA-seq data by performing a variance stabilizing transformation on the data. This ensures that low expressed genes with high variance do not dominate the PCA plot.
After variance stabilizing the data we take the top 500 most variable genes and use them to perform PCA.
> library(DESeq2)
> design = ~group + celltype + phenotype + phenotype:celltype
> dds = DESeqDataSetFromMatrix(countData = counts, colData = metadata, design = design)
> vst = varianceStabilizingTransformation(dds)
> pca_loadings = function(object, ntop = 500) {
+ rv <- matrixStats::rowVars(assay(object))
+ select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
+ pca <- prcomp(t(assay(object)[select, ]))
+ percentVar <- pca$sdev^2/sum(pca$sdev^2)
+ names(percentVar) = colnames(pca$x)
+ pca$percentVar = percentVar
+ return(pca)
+ }
> pc = pca_loadings(vst)
> comps = data.frame(pc$x)
> comps$Name = rownames(comps)
> library(dplyr)
> comps = comps %>% left_join(metadata, by = c(Name = "cell"))
> pca_plot = function(comps, nc1, nc2, colorby) {
+ c1str = paste0("PC", nc1)
+ c2str = paste0("PC", nc2)
+ ggplot(comps, aes_string(c1str, c2str, color = colorby)) + geom_point() +
+ theme_bw() + xlab(paste0(c1str, ": ", round(pc$percentVar[nc1] * 100),
+ "% variance")) + ylab(paste0(c2str, ": ", round(pc$percentVar[nc2] *
+ 100), "% variance"))
+ }
We can see a great separation by cell type along PC1:
> pca_plot(comps, 1, 2, "celltype")
We see plotting PC2 vs PC3 a more subtle difference between phenotypes. RR and MS look mixed but there seems to be a separation between the control and the affected samples.
> pca_plot(comps, 2, 3, "phenotype")
Part of the reason for the poor higher component separation is we are using the 500 genes that are dominated by cell type. If we first separate out the cell types and then do the PCA, we might see a better separation based on phenotype.
> monometa = subset(metadata, celltype == "mono")
> monocounts = counts[, monometa$cell]
> tcellmeta = subset(metadata, celltype == "Tcells")
> tcellcounts = counts[, tcellmeta$cell]
> design = ~group + phenotype
> monodds = DESeqDataSetFromMatrix(countData = monocounts, colData = monometa,
+ design = design)
> tcelldds = DESeqDataSetFromMatrix(countData = tcellcounts, colData = tcellmeta,
+ design = design)
> monovst = varianceStabilizingTransformation(monodds)
> tcellvst = varianceStabilizingTransformation(tcelldds)
> monopc = pca_loadings(monovst)
> tcellpc = pca_loadings(tcellvst)
> monocomps = data.frame(monopc$x)
> tcellcomps = data.frame(tcellpc$x)
> monocomps$cell = rownames(monocomps)
> tcellcomps$cell = rownames(tcellcomps)
> monocomps = monocomps %>% left_join(metadata, by = c(cell = "cell"))
> tcellcomps = tcellcomps %>% left_join(metadata, by = c(cell = "cell"))
We can see a separation along component 2 with which gender the samples are from:
> pca_plot(tcellcomps, 1, 2, "gender")
But component 1 is a little bit of a mystery, it does kind of separate by phenotype, all of the HC samples are clustered on the left and the right is more mixed MS/RR.
> pca_plot(tcellcomps, 1, 2, "phenotype")
We don’t see the same gender based separation in the monocytes that we saw with the Tcells.
There isn’t a separation based on gender in PC1-PC2, it isn’t clear what is separating these cells because they also don’t separate by phenotype:
> pca_plot(monocomps, 1, 2, "gender")
> pca_plot(monocomps, 1, 2, "phenotype")
Samples do separate along PC3 and PC4 by gender. So for T-cells there seems to be a larger gender effect than the monocytes.
> pca_plot(monocomps, 3, 4, "gender")
> pca_plot(monocomps, 3, 4, "phenotype")
The cells all look good so we don’t need to do any filtering of the individual cells. We’ll drop all genes that don’t have any counts in any cells, and then combine the counts for all of the technical replicate cells by adding them together before doing differential expression.
> monomelt = monocounts %>% add_rownames() %>% tidyr::gather(cell, count, -rowname) %>%
+ left_join(monometa, by = "cell") %>% select(rowname, sample, count) %>%
+ group_by(rowname, sample) %>% summarise(total = sum(count)) %>% tidyr::spread(sample,
+ total) %>% as.data.frame()
> rownames(monomelt) = monomelt$rowname
> monomelt$rowname = NULL
> monomelt = monomelt[rowSums(monomelt) > 0, ]
> monometa = monometa %>% group_by(sample) %>% filter(row_number() == 1) %>% select(-cell)
> monometa = as.data.frame(monometa)
> rownames(monometa) = monometa$sample
> monomelt = monomelt[, rownames(monometa)]
>
> tcellmelt = tcellcounts %>% add_rownames() %>% tidyr::gather(cell, count, -rowname) %>%
+ left_join(tcellmeta, by = "cell") %>% select(rowname, sample, count) %>%
+ group_by(rowname, sample) %>% summarise(total = sum(count)) %>% tidyr::spread(sample,
+ total) %>% as.data.frame()
> rownames(tcellmelt) = tcellmelt$rowname
> tcellmelt$rowname = NULL
> tcellmelt = tcellmelt[rowSums(tcellmelt) > 0, ]
> tcellmeta = tcellmeta %>% group_by(sample) %>% filter(row_number() == 1) %>%
+ select(-cell)
> tcellmeta = as.data.frame(tcellmeta)
> rownames(tcellmeta) = tcellmeta$sample
> tcellmelt = tcellmelt[, rownames(tcellmeta)]
We will fit the model to the two different celltypes separately. We fit a model of the form: ~group+phenotype where we compare only the cells from the same group to each other and look for differences in phenotype.
The dispersion plot looks awesome, in this plot the black dots are the estimates of the dispersion if you just considered each gene by itself, and the red line is the best fit of the dispersion for genes of a given expression value. The blue dots are the final value for the dispersions, which for outlier spots are pulled back towards the fitted line.
> design = ~group + phenotype
> monodds = DESeqDataSetFromMatrix(countData = monomelt, colData = monometa, design = design)
> monodds = estimateSizeFactors(monodds)
> monodds = DESeq(monodds)
> plotDispEsts(monodds)
> monorr_vs_hc = results(monodds, contrast = c("phenotype", "RR", "HC"))
> plotMA(monorr_vs_hc)
> monorr_vs_hc_sig = subset(monorr_vs_hc, padj < 0.1)
> monorr_vs_hc = monorr_vs_hc[order(monorr_vs_hc$padj), ]
> monorr_vs_hc = cbind(data.frame(gene = rownames(monorr_vs_hc)), monorr_vs_hc)
> write.table(monorr_vs_hc, file = "mono-rr-vs-hc.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 26 genes different between RR and HC samples in the monocytes, using a FDR cutoff of 0.1. Of those 10 are up in the RR samples and 16 are down in the RR samples.
> monosp_vs_hc = results(monodds, contrast = c("phenotype", "SP", "HC"))
> monosp_vs_hc_sig = subset(monosp_vs_hc, padj < 0.1)
> plotMA(monosp_vs_hc)
> monosp_vs_hc = monosp_vs_hc[order(monosp_vs_hc$padj), ]
> monosp_vs_hc = cbind(data.frame(gene = rownames(monosp_vs_hc)), monosp_vs_hc)
> write.table(monosp_vs_hc, file = "mono-sp-vs-hc.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 76 genes different between SP and HC samples in the monocytes, using a FDR cutoff of 0.1. Of those 35 are up in the SP samples and 41 are down in the SP samples.
> monosp_vs_rr = results(monodds, contrast = c("phenotype", "SP", "RR"))
> monosp_vs_rr_sig = subset(monosp_vs_rr, padj < 0.1)
> plotMA(monosp_vs_rr)
> monosp_vs_rr = monosp_vs_rr[order(monosp_vs_rr$padj), ]
> monosp_vs_rr = cbind(data.frame(gene = rownames(monosp_vs_rr)), monosp_vs_rr)
> write.table(monosp_vs_rr, file = "mono-sp-vs-rr.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 1 genes different between SP and RR samples in the monocytes, using a FDR cutoff of 0.1. Of those 1 are up in the SP samples and 0 are down in the SP samples.
> design = ~group + phenotype
> tcelldds = DESeqDataSetFromMatrix(countData = tcellmelt, colData = tcellmeta,
+ design = design)
> tcelldds = estimateSizeFactors(tcelldds)
> tcelldds = DESeq(tcelldds)
> plotDispEsts(tcelldds)
> tcellrr_vs_hc = results(tcelldds, contrast = c("phenotype", "RR", "HC"))
> tcellrr_vs_hc_sig = subset(tcellrr_vs_hc, padj < 0.1)
> plotMA(tcellrr_vs_hc)
> tcellrr_vs_hc = tcellrr_vs_hc[order(tcellrr_vs_hc$padj), ]
> tcellrr_vs_hc = cbind(data.frame(gene = rownames(tcellrr_vs_hc)), tcellrr_vs_hc)
> write.table(tcellrr_vs_hc, file = "tcell-rr-vs-hc.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 1 genes different between RR and HC samples in the T-cells, using a FDR cutoff of 0.1. Of those 0 are up in the RR samples and 1 are down in the RR samples.
> tcellsp_vs_hc = results(tcelldds, contrast = c("phenotype", "SP", "HC"))
> tcellsp_vs_hc_sig = subset(tcellsp_vs_hc, padj < 0.1)
> plotMA(tcellsp_vs_hc)
> tcellsp_vs_hc = tcellsp_vs_hc[order(tcellsp_vs_hc$padj), ]
> tcellsp_vs_hc = cbind(data.frame(gene = rownames(tcellsp_vs_hc)), tcellsp_vs_hc)
> write.table(tcellsp_vs_hc, file = "tcell-sp-vs-hc.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 75 genes different between sp and HC samples in the T-cells, using a FDR cutoff of 0.1. Of those 37 are up in the SP samples and 38 are down in the SP samples.
> tcellsp_vs_rr = results(tcelldds, contrast = c("phenotype", "SP", "RR"))
> tcellsp_vs_rr_sig = subset(tcellsp_vs_rr, padj < 0.1)
> plotMA(tcellsp_vs_rr)
> tcellsp_vs_rr = tcellsp_vs_rr[order(tcellsp_vs_rr$padj), ]
> tcellsp_vs_rr = cbind(data.frame(gene = rownames(tcellsp_vs_rr)), tcellsp_vs_rr)
> write.table(tcellsp_vs_rr, file = "tcell-sp-vs-rr.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 0 genes different between sp and RR samples in the T-cells, using a FDR cutoff of 0.1. Of those 0 are up in the SP samples and 0 are down in the SP samples.
> monometa$affected = ifelse(monometa$phenotype == "HC", "unaffected", "affected")
> design = ~group + affected
> monoaffecteddds = DESeqDataSetFromMatrix(countData = monomelt, colData = monometa,
+ design = design)
> monoaffecteddds = estimateSizeFactors(monoaffecteddds)
> monoaffecteddds = DESeq(monoaffecteddds)
> monoaffected = results(monoaffecteddds, contrast = c("affected", "affected",
+ "unaffected"))
> monoaffected_sig = subset(monoaffected, padj < 0.1)
> plotMA(monoaffected)
> monoaffected = monoaffected[order(monoaffected$padj), ]
> monoaffected = cbind(data.frame(gene = rownames(monoaffected)), monoaffected)
> write.table(monoaffected, file = "monoaffected.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 155 genes different between affected and unaffected monocytes, using a FDR cutoff of 0.1. Of those 67 are up in the affected samples and 88 are down in the affected samples.
> tcellmeta$affected = ifelse(tcellmeta$phenotype == "HC", "unaffected", "affected")
> design = ~group + affected
> tcellaffecteddds = DESeqDataSetFromMatrix(countData = tcellmelt, colData = tcellmeta,
+ design = design)
> tcellaffecteddds = estimateSizeFactors(tcellaffecteddds)
> tcellaffecteddds = DESeq(tcellaffecteddds)
> tcellaffected = results(tcellaffecteddds, contrast = c("affected", "affected",
+ "unaffected"))
> tcellaffected_sig = subset(tcellaffected, padj < 0.1)
> plotMA(tcellaffected)
> tcellaffected = tcellaffected[order(tcellaffected$padj), ]
> tcellaffected = cbind(data.frame(gene = rownames(tcellaffected)), tcellaffected)
> write.table(tcellaffected, file = "tcellaffected.csv", row.names = FALSE, col.names = TRUE,
+ quote = FALSE, sep = ",")
We flag 61 genes different between affected and unaffected tcells, using a FDR cutoff of 0.1. Of those 29 are up in the affected samples and 32 are down in the affected samples.